Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm
نویسندگان
چکیده
This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution.
منابع مشابه
A survey of modern authorship attribution methods
Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, informati...
متن کاملStylometric Analysis of Early Modern Period English Plays
Function word adjacency networks (WANs) are used to study the authorship of plays from the Early Modern English period. In these networks, nodes are function words and directed edges between two nodes represent the likelihood of ordered co-appearance of the two words. For every analyzed play a WAN is constructed and these are aggregated to generate author profile networks. We first study the si...
متن کاملCreutzfeldt-Jacob Disease in an Iranian Patient Confirmed By Brain Autopsy
Creutzfeldt-Jacob disease is the most common form of prion diseases, which have become public health problems in the last two decades because of the high number of reported cases of mad cow disease in Great Britain and other countries. Creutzfeldt-Jacob disease is a fatal situation with known cardinal clinical features including progressive memory loss and myoclonic seizure disorder. In this re...
متن کامل“Editorial Letter” Managing Authorship Conflicts: A Guide for Researchers
Nowadays, the issue of Authorship has increasingly expanded in the academic world and has resulted in a lot of problems for scholars and the editorial board of scientific Journals. Given the increasing number of postgraduate students and subsequent research activities, every day we witness the formal and informal arguments on the ambiguity and unfairness of decisions regarding authors' names an...
متن کاملExplaining Delta, or: How do distance measures for authorship attribution work?
Authorship Attribution is a research area in quantitative text analysis concerned with attributing texts of unknown or disputed authorship to their actual author based on quantitatively measured linguistic evidence (see Juola 2006; Stamatatos 2009; Koppel et al. 2009). Authorship attribution has applications in literary studies, history, forensics and many other fields, e.g. corpus stylistics (...
متن کامل